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ABSTRACT 

One of the most powerful techniques to study protein struc- 
tures is to look for recurrent fragments (also called substruc- 
tures or spatial motifs) , then use them as patterns to charac- 
terize the proteins under study. An emergent trend consists 
in parsing proteins three-dimensional (3D) structures into 
graphs of amino acids. Hence, the search of recurrent spa- 
tial motifs is formulated as a process of frequent subgraph 
discovery where each subgraph represents a spatial motif. In 
this scope, several efficient approaches for frequent subgraph 
discovery have been proposed in the literature. However, the 
set of discovered frequent subgraphs is too large to be effi- 
ciently analyzed and explored in any further process. In this 
paper, we propose a novel pattern selection approach that 
shrinks the large number of discovered frequent subgraphs 
by selecting the representative ones. Existing pattern selec- 
tion approaches do not exploit the domain knowledge. Yet, 
in our approach we incorporate the evolutionary information 
of amino acids defined in the substitution matrices in order 
to select the representative subgraphs. We show the effec- 
tiveness of our approach on a number of real datasets. The 
results issued from our experiments show that our approach 
is able to considerably decrease the number of motifs while 
enhancing their interestingness. 

Categories and Subject Descriptors 

E.4 [Coding and information theory]: [Database Appli- 
cations - Data Mining] 

*An implementation of the proposed approach and 
the datasets used in the experiments are freely 
available on the first authors personal web page: 
|http: / / fc.isima.fr / ^dhifli / u nsubpatt / 1 or by email request 
to any one ot tJie authors 

^This paper is the full version of a preliminary work accepted 
as abstract in MLCB'12 (NIPS workshop) 
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1. INTRODUCTION 

Studying protein structures can reveal relevant structural 
and functional information which may not be derived from 
protein sequences alone. During recent years, various meth- 
ods that study protein structures have been elaborated based 
on diverse types of descriptor such as profiles 25 , spatial 
motifs 17, 21 and others. Yet, the exponential growth of 
online databases such as the Protein Data Bank (PDB) j4], 
CATH 7 , SCOP 2 and others, arises an urgent need for 
more accurate methods that will help to better understand 
the studied phenomenons such as protein evolution, func- 
tions, etc. 

In this scope, proteins have recently been interpreted as 
graphs of amino acids and studied based on graph theory 
concepts 14 . This representation enables the use of graph 
mining techniques to study protein structures in a graph 
perspective. In fact, in graph mining, any problem or object 
under consideration is represented in the form of nodes and 
edges and studied based on graph theory concepts [3| [12j [l6] 
One of the powerful and current trends in graph mining is 
frequent subgraph discovery. It aims to discover subgraphs 
that frequently occur in a graph dataset and use them as 
patterns to describe the data. These patterns are lately 
analyzed by domain experts to reveal interesting information 
hidden in the original graphs, such as discovering pathways 
in metabolic networks [9j , identifying residues that play the 
role of hubs in the protein and stabilize its structure [24| , 
etc. 

The graph isomorphism test is one of the main bottle- 
necks of frequent subgraph mining. Yet, many efficient and 
scalable algorithms have been proposed in the literature and 
made it feasible for instance FFSM |15 , gSpan 29 , GAS- 
TON [is], etc. Unfortunately, the exponential number of 
discovered frequent subgraphs is another serious issue that 
still needs more attention 27 , since it may hinder or even 
make any further analysis unfeasible due to time, resources, 
and computational limitations. For example, in an AIDS an- 
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tiviral screen dataset composed of 422 chemical compounds, 
there are more than 1 mihion frequent substructures when 
the minimum support is 5%. This problem becomes even 
more serious with graphs of higher density such as those 
representing protein structures. In fact, the issues raised 
from the huge number of frequent subgraphs are mainly due 
to two factors, namely redundancy and significance |22j. Re- 
dundancy in a frequent subgraph set is caused by structural 
and/or semantic similarity, since most discovered subgraphs 
differ slightly in structure and may infer similar or even the 
same meaning. Moreover, the significance of the discovered 
frequent subgraphs is only related to frequency. This yields 
an urgent need for efficient approaches allowing to select rel- 
evant patterns among the large set of frequent subgraphs. 

In this paper, we propose a novel selection approach which 
selects a subset of representative patterns from a set of la- 
beled subgraphs, we term them unsubstituted patterns. In 
order to select these unsubstituted patterns and to shrink 
the large size of the initial set of frequent subgraphs, we 
exploit a specific domain knowledge, which is the substitu- 
tion between amino acids represented as nodes. Though, 
the main contribution of this work is to define a new ap- 
proach for mining a representative summary of the set of 
frequent subgraphs by incorporating a specific background 
domain knowledge which is the ability of substitution be- 
tween nodes labels in the graph. In this work, we apply 
the proposed approach on protein structures because of the 
availability of substitution matrices in the literature, how- 
ever, it can be considered as general framework for other 
applications whenever it is possible to define a matrix quan- 
tifying the possible substitution between the labels. Our 
approach can also be used on any type of subgraph struc- 
ture such as cliques, trees and paths (sequences). In addi- 
tion, it can be easily coupled with other pattern selection 
methods such as discrimination or orthogonality based ap- 
proaches. Moreover, this approach is unsupervised and can 
help in various mining tasks, unlike other approaches that 
are supervised and dedicated to a specific task such as clas- 
sification. 

The remainder of the paper is organized as follows. Sec- 
tion [2] discusses the recent related works in the area of pat- 
tern selection for subgraphs. In Section [3] we present the 
background of our work and we define the preliminary con- 
cepts as well as the main algorithm of our approach. Then, 
Section [4] describes the characteristics of the used data and 
the experimental settings. Section |5] presents the obtained 
experimental results and the discussion. It is worth noting 
that in the rest of the paper, we use the following terms 
interchangeably : spatial motifs, patterns, subgraphs. 

2. RELATED WORKS 

Recently, several approaches have been proposed for pat- 
tern selection in subgraph mining. In ^5l, authors proposed 
ORIGAMI, an approach for both subgraph discovery and 
selection. First they randomly mine a sample of maximal 
frequent subgraphs, then straightforwardly they select an 
a-orthogonal (non-redundant), /^-representative subgraphs 
from the mined set. The LEAP algorithm proposed in 28 
tries to locate patterns that individually have high discrim- 
ination power, using an objective function score that mea- 
sures each pattern's significance. Another approach termed 
gPLS was proposed in 20 . It attempts to select a set of 
informative subgraphs in order to rapidly build a classifier. 



gPLS uses the mathematical concept of partial least squares 
regression to create latent variables allowing a better pre- 
diction. COM [Tg] is another subgraph selection approach 
which follows a process of pattern mining and classifier learn- 
ing. It mines co-occurrence rules. Then, based on the co- 
occurrence information it assembles weak features in order 
to generate strong ones. In 22^, authors proposed a fea- 
ture selection approach termed CORK. To find frequent sub- 
graphs, it uses the state-of-the-art approach gSpan. Then 
using a submodular quality function, it selects among them 
the subset of subgraphs that are most discriminative for 
classification. In 10 , authors designed LPGBCMP, a gen- 



eral model which selects clustered features by considering 
the structure relationship between subgraph patterns in the 
functional space. The selected subgraphs are used as weak 
classifiers (base learners) to obtain high quality classifica- 
tion models. To the best of our knowledge, in all existing 
subgraph selection approaches 11 , the selection is usually 
based on structural similarity ^ and/or statistical measures 
(e.g. frequency and coverage (closed [30], maximal [23]), dis- 
crimination |22j, ...). Yet, the prior information and knowl- 
edge about the domain are often ignored. However, these 
prior knowledge may help building dedicated approaches 
that best fit the studied data. 



3. MINING REPRESENTATIVE UNSUBSTI- 
TUTED PATTERNS 

3.1 Background 

Statistical pattern selection methods have been widely 
used to resolve the dimensionality problem when the number 
of discovered patterns is too large. However, these methods 
are too generic and do not consider the specificity of the 
domain and the used data. We believe that in many con- 
texts, it would be important to incorporate the background 
knowledge about the domain in order to create approaches 
that best fit the considered data. In proteomics, a protein 
structure is composed by the folding of a set of amino acids. 
During evolution, amino acids can substitute each other. 
The scores of substitution between pairs of amino acids were 
quantified by biologists in the literature in the form of sub- 
stitution matrices such as Blosum62 [8j. Our approach uses 
the substitution information given in the substitution ma- 
trices in order to select a subset of unsubstituted patterns 
that summarizes the whole set of frequent subgraphs. We 
consider the selected patterns as the representative ones. 

The main idea of our approach is based on node substi- 
tution. Since the nodes of a protein graph represent amino 
acids, though, using a substitution matrix, it would be pos- 
sible to quantify the substitution between two given sub- 
graphs. Starting from this idea, we define a similarity func- 
tion that measures the distance between a given pair of sub- 
graphs. Then, we preserve only one subgraph from each pair 
having a similarity score greater or equal to a user specified 
threshold such that the preserved subgraphs represent the 
set of representative unsubstituted patterns. An overview of 
the proposed approach is illustrated in Figure^ and a more 
detailed description is given in the following sections. 

The substitution between amino acids was also used in 
the literature but for sequential feature extraction from pro- 
tein sequences in 19 , where the authors proposed a novel 
feature extraction approach termed DDSM for protein se- 
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Figure 1: Unsubstituted pattern selection frame- 
work. 



quence classification. Their approach is restricted to protein 
sequences and generates every subsequence substituting an- 
other one. In other words, DDSM eliminates any pattern 
substituted by another one and which itself does not sub- 
stitute any other one. We believe that their approach does 
not guarantee an optimal summarization since its output 
may still contain patterns that substitute each other. In 
addition, they do consider negative substitution scores as 
impossible substitutions which is biologically not true since 
negative scores are only expressing the less likely substitu- 
tions, which obviously does not mean that they are impos- 
sible. Moreover, DDSM is limited to protein sequences and 
does not handle more complex structures such as the protein 
tertiary structure. Our approach overcomes these shortcom- 
ings, since it can handle both protein sequences and struc- 
tures (since a sequence can be seen as a line graph). In 
addition, it consider both the positive and negative scores 
of the matrix. Moreover, our approach generates a set of 
representative unsubstituted patterns ensuring an optimal 
summarization of the initial set. Besides, it is unsupervised 
and can be exploited in classification as well as other anal- 
ysis and knowledge discovery contexts unlike DDSM which 
is dedicated to classification. 

3.2 Preliminaries 

In this section we present the fundamental definitions and 
the formal problem statement. Let ^ be a dataset of graphs. 
Each graph G — {V,E,L) of Q is given as a collection of 
nodes V and a collection of edges E. The nodes of V are 
labeled within an alphabet L. We denote by |y| the number 
of nodes (also called the graph order) and by |^| the number 
of edges (also called the graph size). Let also Q be the set 
of frequent subgraphs extracted from also referred here 
as patterns. 

Definition 1. (Substitution matrix) Given an alphabet L, 
a substitution matrix M over L is the function defined as: 



M : 



a, 



[±,T] c R 



(1) 



The higher the value of x is, the likely is the substitution 
oi I' hj I. li X — 1. then the substitution is impossible, and 
if X = T then the substitution is certain. The values ± and 
T are optional and user-specified. They may appear or not 
in M. The scores in M should respect the following two 
properties: 

L V / G L, 3 G L I Mil, I') / ±, 

2. V / G L, if 3 G L I M{l,l') = T then V /"G L\{1, V}, 
M{l,r) = ± and M{l',r) = ±. 

In many real world applications, the substitution matrices 
may contain at the same time positive and negative scores. 



In the case of protein's substitution matrices, both posi- 
tive and negative values represents possible substitutions. 
However, positive scores are given to the more likely sub- 
stitutions while negative scores are given to the less likely 
substitutions. Though, in order to give more magnitude to 
higher values of x, V / and I' ^ L: 



M{1, 



(2) 



Definition 2. (Structural isomorphism) Two patterns P — 
(Vp, Ep, L) and P' = (Vpf , Epf , L) are said to be struc- 
turally isomorphic (having the same shape), we note shape{P, P') 
shape(P, P') — true , iff: 



P and P' have the same order, i.e., |Vp| 
P and P' have the same size, i.e., \Ep\ - 



-- \Vp' 



- 3 a bijective function f : Vp ^ Vp' \ \/u,v G Vp if 
{u,v) G Ep then (f(u),f{v)) G Ep^ and inversely. 

It is worth mentioning that in the graph isomorphism 
problem we test whether two graphs are exactly the same by 
considering both structures and labels. But in this defini- 
tion, we only test whether two given graphs are structurally 
the same, in terms of nodes and edges, without considering 
the labels. 

Definition 3. (Elementary mutation probability) Given a 
node V oid, label I ^ L, the elementary mutation probability, 
Mei{v), measures the possibility that v stay itself and does 
not mutate to any other node depending on its label /. 



Mel{v) 



fO, ifM{l,l) = e^ 
1, ifM{l,l)^e' 
— I -^^^'^^ otherwise 



(3) 



Obviously, if the substitution score in M between I and 
itself is ± then it is certain that / will mutate to another 
label I' and the probability value that v does not mutate 
should be 0. Respectively, if this substitution score is T 
then it is certain that v will stay itself and conserve its label 
/ so the probability value must be equal to 1. Otherwise, we 
divide the score that / mutates to itself by the sum of all the 
possible mutations. 

Definition 4- (Pattern mutation probability) Given a pat- 
tern P = (Vp, Ep, L) G ^1, the pattern mutation probability, 
Mpatt{P), measures the possibility that P mutates to any 
other pattern having the same order. 



MpattiP) = 1- 



\Vp\ 



(4) 



where fll^^' ^ei{P[i]) represents the probability that the 
pattern P does not mutate to any other pattern i.e. P stays 
itself. 

Definition 5. (Elementary substitution probability) Given 
two nodes v and v' having correspondingly the labels 1,1' ^ 
L, the elementary substitution probability, Sei{v,v'), mea- 
sures the possibility that v substitutes v' . 



Sel(v,v') 



M{1, 



M{l,l) 



(5) 



It is worth noting that Sei is not bijective i.e. Sei{v, v') / 

Sel{v' ,V). 



Definition 6. (Pattern substitution score) Given two pat- 
terns P = {Vp,Ep,L) and P' = {Vp',Ep',L) having the 
same shape, we denote by Spatt{P, P') the substitution score 
of P' hy P. In other words, it measures the possibihty that P 
mutates to P' by computing the sum of the elementary sub- 
stitution probabihties then normaUze it by the total number 
of nodes of P. Formally: 

Definition 1. (Pattern substitution) A pattern P substi- 
tutes a pattern P' ^ we note subst{P,P\T) = true, iff: 

1. P and P' are structurally isomorphic {shape{P, P') — 
true), 

2. Spatt{P^P') > T, where r is a user-specified threshold 
such that 0% < r < 100%. 

Definition 8. (Unsubstituted pattern) Given a threshold 
T and G r^, a pattern P* G Q^* is said to be unsubstituted 
iff G 11* I MpattiP) > Mpatt{P*) and suhst{P,P\T) = 
true. 

Proposition 1 (Null Mpatt case). Given a pattern P ^ 
(Vp, Ep, L) G ^, if Mpatt{P) = then P is an unsubstituted 
pattern. 

Proof. The proof can be simply deduced from Defini- 
tions [3]and[l □ 

Definition 9. (Merge support) Given two patterns P and 
P' , if P substitutes P' then P will represent P' in the list 
of graphs where P' occurs. Formally: 

V(P, P') I subst{P, P\r)= true then Dp = DpUDp^ (7) 

where Dp and Dp/ are correspondingly the occurrence set 
of P and that of P' . 

3.3 Algorithm 

Given a set of patterns Q and a substitution matrix M , we 
propose UNSuBPATT(see Algorithm [T]) , a pattern selection 
algorithm which enables detecting the set of unsubstituted 
patterns Q* within Q. Based on our similarity concept, all 
the patterns in Q* are dissimilar, since it does not contain 
any pair of patterns that are substitutable. This represents 
a reliable summarization of Q. 

The general process of the algorithm is described as fol- 
lows: first, Q is divided into subsets of patterns having the 
same number of nodes and edges. Then, each subset is sorted 
in a descending order by the pattern mutation probability 
Mpatt- Each subset is browsed starting from the pattern 
having the highest Mpatt- For each pattern, we remove all 
the patterns it substitutes and we merge their supports such 
that the preserved pattern will represent all the removed 
ones wherever they occurs. The remaining patterns repre- 
sent the unsubstituted pattern set. Though, Q* can not be 
summarized by a subset of it but itself. Our algorithm uses 
Proposition [l] to avoid unnecessary computation related to 
patterns with Mpatt — 0. They are directly considered as 
unsubstituted patterns, since they can not mutate to any 
other pattern. 



Algorithm 1: UnSubPatt 

Data: Q, X, r, (^,T) [Optional] 
Result: Q*: {unsubstituted patterns} 
begin 

{Q.'' ^ {p I v(p^p") e r^Mi/p/i = 

|yp"|and|^Pn = I^P"|}}; 
foreach G Q do 

VL^ ^ sort{n^ by Mpatt); 
foreach P G do 

if Mpatt{P) > then 
foreach P' e n^\P \ 

Mpatt{P') < Mpatt{P) do 

if Mpatt{P') > then 
if shape(P, P') and 
subst{P,P' ,t) then 

merge_support{P, P'); 

remove{P',Q^); 

11* ^ 11* UQ^ 



Theorem 1. Let Q be a set of patterns and Q* its subset 
of unsubstituted patterns based on a substitution matrix M 
and a threshold r, i.e., UnSubPatt (r2, M, r, (±, T)) = Q* . 
Then : 

UNSuBPATT(r2*,yW,r, (±,T)) = 11* (8) 

Proof. The proof can be deduced simply from Definition 
[s] Given a threshold r, Q* can not be summarized by its 
subsets unless itself. Formally: 

VP G ^\$P' G n^\Mpatt{P) > MpattiP') and subst{P, P' ,t) 

(9) 

□ 

3.4 Complexity 

Suppose Q contains n patterns. Q is divided into g groups, 
each containing patterns of order k. This is done in 0{n). 
Each group O!^ is sorted in 0{\QJ^\ * log\Q!^\). Searching 
for unsubstituted patterns requires browsing O!^ (0(|11^|)) 
and for each pattern, browsing in the worst case all remain- 
ing patterns (0(|11^|)) to check the shape {0{k)) and the 
substitution {0(k)). This means that searching for unsub- 
stituted patterns in a group O!^ can be done in 0(|H^|^ */c^). 
Hence, in the worst case, the complexity of our algorithm is 
0{g^m^ax *^max)5 whcrc kmax is the maximum pattern or- 
der and rrimax is the number of patterns of the largest group 

4. EXPERIMENTS 
4.1 Datasets 

In order to experimentally evaluate our approach, we use 
four graph datasets of protein structures, which also have 
been used in 28 then 10 . Each dataset consists of two 
classes: positive and negative. Positive samples are proteins 
selected from a considered protein family whereas negative 
samples are proteins randomly gathered from the Protein 
Data Bank [4]. Each protein is parsed into a graph of amino 
acids. Each node represents an amino acid residue and is la- 
beled with its amino acid type. Two nodes u and v are linked 



by an edge e{u, v) = lif the euchdean distance between their 
two Ca atoms A{Co,{u), Ca{v)) is below a threshold distance 
6. Formally: 



Table 2: Number of frequent subgraphs (Q), repre- 
sentative unsubstituted subgraphs (^*) and the se- 
lection rate 



e(u, v) 



1, if A{Cc.{u),Cc.{v)) <6 
0, otherwise 



(10) 



In the literature, many methods use this definition with 
usually ^ > 7A on the argument that Ca atoms define the 
overall shape of the protein conformation 13 . In our exper- 
iments, we use ^ = 7A. 

Table [l] summarizes the characteristics of each dataset. 
SCOP ID, Avg.|V|, Avg.|E|, Max.|V| and Max.|E| corre- 
spond respectively to the id of the protein family in SCOP 
database 2 , the average number of nodes, the average num- 
ber of edges, the maximal number of nodes and the maximal 
number of edges in each dataset. 

4.2 Protocol and Settings 

Generally, in a pattern selection approach two aspects are 
emphasized, namely the number of selected patterns and 
their interest ingness. In order to evaluate our approach, 
we first use the state-of-the-art method of frequent sub- 
graph discovery gSpan 29 to find the frequent subgraphs in 
each dataset with a minimum frequency threshold of 30%. 
Then, we use UnSubPatt to select the unsubstituted pat- 
terns among them with a minimum substitution threshold 
T=30%. For our approach, we use Blosum62 ^ as the sub- 
stitution matrix since it turns out that it performs well on 
detecting the majority of weak protein similarities, and it is 
used as the default matrix by most biological applications 
such as BLAST 1 . It is worth mentioning that the choice 
of 30% as minimum frequency threshold for the frequent 
subgraph extraction is to make the experimental evaluation 
feasible due to time and computational limitations. 

In order to evaluate the number of selected subgraphs, we 
define the selection rate as the rate of the number of unsub- 
stituted subgraphs from the initial set of frequent subgraphs. 
Formally : 

Selection rate = loi"*"^^ (-'--'-) 

To evaluate the interestingness of the set of selected pat- 
terns, we use them as features for classification. We per- 
form a 5-fold cross-validation classification (5 runs) on each 
protein-structure dataset. We encode each protein into a 
binary vector, denoting by "1" or "0" the presence or the ab- 
sence of the feature in the considered protein. To judge the 
interestingness of the selected subgraphs, we use one of the 
most known classifier, namely the naive bayes (NB) classi- 
fier, due to its simplicity and fast prediction and that its 
classification technique is based on a global and conditional 
evaluation of the input features. NB is used with the default 
parameters from the workbench Weka [26]. 

5. RESULTS AND DISCUSSION 

In this section, we conduct experiments to examine the 
effectiveness and efficiency of UnSubPatt in finding the 
representative unsubstituted subgraphs. We test the effect 
of changing the substitution matrix and the substitution 
threshold on the results. Moreover, we study the size-based 
distribution of patterns and we compare the results of our 
approach with those of other subgraph selection methods 
from the literature. 



Dataset 


\Q\ 




Selection rate (%) 


DSl 


799094 


7291 


0.91 


DS2 


258371 


15898 


6.15 


DS3 


114792 


14713 


12.82 


DS4 


1073393 


9958 


0.93 



5.1 Empirical Results 

Here, we show the results of our experiments obtained 
in terms of number of motifs and classification results. As 
mentioned before, we use gSpan to extract the frequent sub- 
graphs from each dataset with frequency > 30%. Then, we 
use UnSubPatt to select the unsubstituted patterns among 
them with a substitution threshold t=30% and using Blo- 
sum62 as substitution matrix. At last, we perform a 5-fold 
cross-validation classification (5 runs) to evaluate the inter- 
estingness of each subset using the NB classifier. The ob- 
tained average results are reported in Table [3] 

The high number of discovered frequent subgraphs is due 
to their combinatorial nature (this was discussed in the in- 
troductory section). The results reported in Table |2] show 
that our approach decreases considerably the number of sub- 
graphs. The selection rate shows that the number of unsub- 
stituted patterns | \ does not exceed 13% of the initial set 
of frequent subgraphs | Q, \ with DS3 and even reaches less 
than 1% with DSl and DS4. This proves that exploiting the 
domain knowledge, which in this case consists in the sub- 
stitution matrix, enables emphasizing information that can 
possibly be ignored when using exiting subgraph selection 
approaches. 

The classification results reported in Table|3]help to evalu- 
ate the interestingness of the selected patterns. Indeed, this 
will demonstrate if the unsubstituted patterns were arbitrar- 
ily selected or they are really representative. Table [3] shows 
that the classification accuracy significantly increases with 
all datasets. We notice a huge leap in accuracy especially 
with DSl and DS4 with a gain of more than 17% and reach- 
ing almost full accuracy with DS4. To better understand the 
accuracy results, we also reports the average precision, re- 
call, F-measure and AUG values for all cases. We notice an 
enhancement of performance with all the mentioned qual- 
ity metrics. This supports the reliability of our selection 
approach. 

5.2 Results Using Other Substitution Matri- 
ces 

Besides Blosum62, biologists also defined other substitu- 
tion matrices describing the likelihood that two amino acid 
types would mutate to each other in evolutionary time. We 
want to study the effect of using other substitution matrices 
on the experimental results. Though, we perform the same 
experiments following the same protocol and settings but us- 
ing two other substitution matrices, namely BlosumSO and 
Pam250. We compare the obtained results in terms of num- 
ber of subgraphs and classification accuracy with those ob- 
tained using the whole set of frequent subgraphs and those 
using subgraphs selected by UnSubPatt with Blosum62. 
The results are reported in Table |4] A high selection rate 
accompanied with a clear enhancement of the classification 



Table 1: Experimental data 



Dataset 


SCOP ID 


Family name 


Pos. 


Neg. 


Avg.|V| 


Avg.|E| 


Max.|V| 


Max.|E| 


DSl 


52592 


G proteins 


33 


33 


246 


971 


897 


3 544 


DS2 


48942 


CI set domains 


38 


38 


238 


928 


768 


2 962 


DS3 


56437 


C-type lectin domains 


38 


38 


185 


719 


755 


3 016 


DS4 


88854 


Kinases, catalytic sub unit 


41 


41 


275 


1077 


775 


3 016 



Table 3: Accuracy, precision, recall (sensitivity), F-score and AUG of the classification of each dataset using 
NB coupled with frequent subgraphs (FSg) then representative unsubstituted subgraphs (UnSubPatt) 



Dataset 


Accuracy 


Precision 


Recall 


F-score 


AUG 


FSg 


UnSubPatt 


FSg 


UnSubPatt 


FSg 


UnSubPatt 


FSg 


UnSubPatt 


FSg 


UnSubPatt 


DSl 


0.62 


0.78 


0.61 


0.69 


0.70 


0.90 


0.64 


0.78 


0.64 


0.78 


DS2 


0.80 


0.90 


0.86 


0.94 


0.74 


0.86 


0.79 


0.89 


0.79 


0.89 


DS3 
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accuracy is noticed using UnSubPatt with all the substitu- 
tion matrices compared to the results using the whole set of 
frequent subgraphs. It is clearly noticed that even using dif- 
ferent substitution matrices, UnSubPatt shows relatively 
similar behavior and is able to select a small yet relevant 
subset of patterns. It is also worth mentioning that for all 
the datasets, the best classification accuracy is obtained us- 
ing Blosum62 and the best selection rate is achieved using 
Pam250. This is simply due to how distant proteins within 
the same dataset are, since each substitution matrix was 
constructed to implicitly express a particular theory of evo- 
lution. Though, choosing the appropriate substitution ma- 
trix can influence the outcome of the analysis. 

5.3 Impact of Substitution Threshold 

In our experiments, we used a substitution threshold (of 
30%) to select the unsubstituted patterns from the set of 
discovered frequent subgraphs. In this section, we study the 
impact of variation of the substitution threshold on both the 
number of selected subgraphs and the classification results. 
To do so, we perform the same experiments while varying 
the substitution threshold from 0% to 90% with a step-size 
of 10. In order to check if the enhancements of the obtained 
results are due to our selected features or to the classifier, 
we use two other well-known classifiers namely the support 
vector machine (SVM) and decision tree (C4.5) besides the 
naive bayes (NB) classifier. We use the same protocol and 
settings of the previous experiments. Figure |2] presents the 
selection rate for different substitution thresholds and Fig- 
ures |3] [4] and [5] illustrate the classification accuracy obtained 
respectively using NB, SVM and C4.5 with each dataset. 
The classification accuracy of the initial set of frequent sub- 
graphs (gSpan, the line in red) is considered as a standard 
value for comparison. Thus, the accuracy values of UnSub- 
Patt (in blue) that are above the line of the standard value 
are considered as gains, and those under the standard value 
are considered as losses. 

In Figure |2] we notice that UnSubPatt reduces consid- 
erably the number of subgraphs especially with lower sub- 
stitution thresholds. In fact, the number of unsubstituted 
patterns does not exceed 30% for all substitution thresholds 
below 70% and even reaches less then 1% in some cases. 
This important reduction in the number of patterns comes 
with a notable enhancement of the classification accuracies. 




30 40 50 60 
Substitution threshold (%} 



Figure 2: Rate of unsubstituted patterns from Q 
depending on the substitution threshold (r). 



This fact is illustrated in Figures [S] ^ and [5] which show 
that the unsubstituted patterns allow better classification 
performance compared to the original set of frequent sub- 
graphs. UnSubPatt scores very well with the three used 
classifiers and even reaches full accuracy in some cases. This 
confirms our assumptions and shows that our selection is re- 
liable and contributes to the enhancement of the accuracy. 
However, we believe that NB allows the most reliable eval- 
uation because it performs a classification based on a global 
and conditional evaluation of features, unlike SVM which 
performs itself another attribute selection to select the sup- 
port vectors and unlike C4.5 which performs an attribute by 
attribute evaluation. 

In the case of proteins, a substitution threshold of 0% 
enables selecting subgraphs based only on their structure. 
Precisely, UnSubPatt will select only one pattern from each 
group of subgraphs that are structurally isomorphic. Based 
on the experimental results, we believe that using these pat- 
terns is enough for a structural classification task since it 
allows a fast selection, selects a very small number of sub- 
graphs and performs very well on classification. 

5.4 Size-based Distribution of Patterns 

In this section, we study the distribution of patterns based 
on their size (number of edges) . We try to check which sizes 
of patterns are more concerned by the selection. The Figures 
[6] and [T] draw the distribution of patterns for the original set 



Table 4: Number of subgraphs (^^SG) and accuracy (ACC) of the classification of each dataset using NB 
coupled with frequent subgraphs (FSg) then representative unsubstituted subgraphs using BlosumSO (Un- 
SubPatt_Blosum80) and Pam250 (UnSubPatt_Pam250) 
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Figure 3: Classification accuracy by NB. ^^y. ^ j. 'v. j.* v 4.4. r T-koi v n ^.u 

^ ^ ^ Figure 6: Distribution oi patterns oi DSl tor all the 

frequent subgraphs and for the representative un- 
substituted ones with all the substitution thresholds 
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Figure 4: Classification accuracy by SVM. 

Figure 7: Distribution of patterns of DS2 for all the 
frequent subgraphs and for the representative un- 
substituted ones with all the substitution thresholds 
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Figure 5: Classification accuracy by C4.5. 



of frequent subgraphs and for the final set of representative 
unsubstituted subgraphs with all the substitution thresholds 
using Blosum62. The downward tendency of UnSubPatt 
using lower substitution thresholds and with respect to the 
original set of frequent subgraph is very clear. In fact, Un- 
SubPatt leans towards cutting off the peaks and flattening 
the curves with lower substitution thresholds. Another in- 
teresting observation is that the curves are flattened in the 
regions of small patterns as well as in those of big and dense 
patterns. This demonstrates the effectiveness of UnSub- 
Patt with both small and big patterns. 

5.5 Comparison with other approaches 

To objectively evaluate our approach, we compare it with 
current trends in subgraph selection. In Figure [S] we report 



iUnSubPatt+SVM 



iLEAP+SVM UgPLS H COM yLPGBCMP 




Table 5: Runtime analysis of UnSubPatt with dif- 
ferent substitution thresholds 



Figure 8: Classification accuracy comparison with 
other pattern selection approaches. 



the classification accuracy using the representative unsub- 
stituted patterns of UnSubPatt besides those using pat- 
terns of other new subgraph selection approaches from the 
literature namely LEAP [28], gPLS[20], C0M[16] and LPG- 
BCMP^lOj (reported and explained in the introductory sec- 
tion). 

For UnSubPatt, we report the results obtained using 
the substitution matrix Blosum62, a minimum substitution 
threshold r = 30% and SVM for classification. For LEAP+SVM, 
LEAP is used iteratively to discover a set of discrimina- 
tive subgraphs with a leap length=0.1. The discovered sub- 
graphs are consider as features to train SVM with a 5-fold 
cross validation. COM is used with tp = 30% and tn = 0%. 
For gPLS, the frequency threshold is set to 30% and the best 
accuracies are reported for all the datasets among all the pa- 
rameters combinations from m = 2, 4, 8, 16 and k = 2, 4, 8, 
16, where m is the number of iterations and k is the number 
of patterns obtained per search. For LPGBCMP, threshold 
values of maxvar = 1 and S = 0.25 were respectively used 
for feature consistency map building and for overlapping. 
The obtained results are reported in the Figurejs] 

The classification results displayed in Figure [8[show that 
UnSubPatt allows a better classification than all the other 
pattern selection methods in the four cases. Considering 
only these results does not allow to confirm that UnSub- 
Patt would always outperform the considered methods. How- 
ever, this proves that UnSubPatt represents a very compet- 
itive and promising approach. It is also worth noting that 
these approaches are supervised and dedicated to classifica- 
tion unlike UnSubPatt which is an unsupervised approach. 
This allows it to be used in classification as well as in other 
mining tasks such as clustering and indexing. 

5.6 Runtime Analysis 

To study the variation of UnSubPatt 's runtime with larger 
amounts of data, we use different sets of frequent patterns 
from 10000 to 100000 with step-size of 10000. In Table[5l we 
report the runtime results for the pattern sets using three 
substitution thresholds. 

Even though the complexity of the problem due to the 
combinatorial test of substitution between subgraphs, our 
algorithm is scalable with higher amounts of data. With in- 
creasing number of patterns, the runtime is still reasonable. 
The use of different substitution thresholds slightly affected 
the runtime of UnSubPatt, since the number of selected 
patterns is comparable for all thresholds. 

A possible way to make UnSubPatt runs faster is par- 
allelization. In fact, UnSubPatt can be easily parallelized, 
since it tests separately the substitution among each group 
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of subgraphs having the same size and order. Hence, these 
groups can be distributed and treated separately in different 
processes. 

6. CONCLUSION 

In this paper, we proposed a novel selection approach for 
mining a representative summary of the set of frequent sub- 
graphs. Unlike existent methods that are based on the re- 
lations between patterns in the transaction space, our ap- 
proach considers the distance between patterns in the pat- 
tern space. The proposed approach exploits a specific do- 
main knowledge, in the form of a substitution matrix, to se- 
lect a subset of representative unsubstituted patterns from 
a given set of frequent subgraphs. It also allows to reduce 
considerably the size of the initial set of subgraphs to ob- 
tain an interesting and representative one enabling easier 
and more efficient further explorations. It is also worth men- 
tioning that our approach can be used on graphs as well as 
on sequences and is not limited to classification tasks, but 
can help in other subgraph-based analysis such as indexing, 
clustering, visual inspection, etc. 

Since the proposed approach is a filter approach, a promis- 
ing future direction could be to find a way to integrate the 
selection within the extraction process in order to directly 
mine the representative patterns from data. Moreover, we 
intend to use our approach in other classification contexts 
and in other mining applications. 
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